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Abstract 


We discuss the real-time scoring logic for a self-administered 
oral reading assessment on mobile devices (Moby.Read) to 
measure the three components of children’s oral reading fluency 
skills: words correct per minute, expression, and comprehen- 
sion. Critical techniques that make the assessment real-time on- 
device are discussed in detail. We propose the idea of produc- 
ing comprehension scores by measuring the semantic similarity 
between the prompt and the retelling response utilizing the re- 
cent advance of document embeddings in natural language pro- 
cessing. By combining features derived from word embedding 
with the normalized number of common types, we achieved a 
human-machine correlation coefficient of 0.90 at the participant 
level for comprehension scores, which was better than the hu- 
man inter-rater correlation 0.88. We achieved a better human- 
machine correlation coefficient than that of the human inter- 
rater in expression scores too. Experimental results demonstrate 
that Moby.Read can provide highly accurate words correct per 
minute, expression and comprehension scores in real-time, and 
validate the use of machine scoring methods to automatically 
measure oral reading fluency skills. 

Index Terms: assessment, oral reading fluency, literacy, ex- 
pression, comprehension 


1. Introduction 


Moby.Read is a new, self-administered, fully automated oral 
reading fluency assessment developed for K-5 students [1, 2, 3]. 
The prototype system was built on an iPad mini 4 as a stand- 
alone app. In each test session, students are asked to read a 
word list, read an easy practice passage, and read three grade- 
level passages. After reading one passage aloud, students are 
asked to retell the passage in their own words, put in all the de- 
tails they can remember, then answer two short questions aloud. 

Fluency is the ability to “read text with speed, accuracy, and 
proper expression” [4]. In this paper, we focused our automatic 
scoring logic on passage reading (PRead) to produce Words 
Correct Per Minute (WCPM) and reading expression scores, 
on passage retelling (PRetell) to produce reading comprehen- 
sion scores. WCPM is a score based on the number of words 
read correctly in a minute of reading, an informative measure 
of oral reading fluency [5]. Expression is the degree that a stu- 
dent can clearly express the meaning and structure of the text 
through appropriate intonation, rhythm, phrasing, and empha- 
sis that will enhance understanding and enjoyment in a listener 
[6]. Comprehension is the degree that a student can retell major 
and minor concepts/themes/facts in the original passage. Scor- 
ing of expression and comprehension will emphasize reading 
for meaning instead of reading for speed. Automatic scoring 
can reduce the need for teacher training and help ensure consis- 
tency. 

Scores are produced in real-time on-device. The advantages 
of real-time on-device are that we may provide scores and feed- 
back immediately; we may select appropriate reading materials 
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adaptively based on the past or real-time performance of a spe- 
cific student to level the student more accurately, etc. 


2. The mobile speech recognition system 


Although automatic speech recognition (ASR) is a compute- 
intensive process, feasible, on-device speech recognition was 
researched [7, 8, 9, 10, 11]. With the recent introduction of the 
neural processing unit (NPU) or neural engine to mobile de- 
vices, we expect that complex acoustic models and language 
models can be implemented on the latest mobile devices to 
achieve better performance. Speech recognition on-device will 
not be a barrier for a wide class of mobile applications. In the 
following subsections, we introduce our system. 


2.1. Acoustic models 


The acoustic model used for speech recognition on-device is a 
Deep Neural Network - Hidden Markov Model (DNN-HMM) 
[12] that contains 4 hidden layers and 300 p-norm (p = 2) non- 
linearity neurons with a group size G = 10 [12] per hidden layer, 
trained using all of Librispeech’s training sets [13]: 960 hours 
of clean native (L1) reading data. The sample rate for the on- 
device speech recognition is 8,000. The inputs of the DNN 
are 40-dimensional log mel-filterbank energies calculated on a 
25ms window every 10ms, and the output dimension is 2,064 
context-dependent triphone states. Both left and right context 
are 6. 

There are several model mismatch issues that may degrade 
the ASR performance: 1) an adult acoustic model was used 
to recognize children speech; 2) narrowband was used; 3) the 
acoustic model was trained using very clean/quiet recordings, 
so the ASR accuracy may diminish with very noisy data. De- 
spite these issues, the overall on-device acoustic model perfor- 
mance is good since we deal with very low perplexity situations 
with suitable language models. 

In our previous work [14, pp. 24], the model mismatch ef- 
fect was researched, such as checking the child test set perfor- 
mance when adult acoustic models are used. We concluded that 
DNNs seem to be good at learning invariant representations of 
speech signals, and adult data could be more suitable for learn- 
ing speech representations. When using mismatch adult acous- 
tic models, the performance damage to constrained item types is 
not so severe. Still, after we collect enough child responses and 
transcriptions, we plan to train a better acoustic model by com- 
bining Librispeech’s data with children’s speech data. Domain- 
specific training data always help [14]. 


2.2. Language models 


Item-specific rule-based language models (RBLMs) [15, 16] are 
built for PRead. No data from this study was used to tune the 
RBLMs. Item-specific 3-gram language models were built for 
PRetell, using all the human transcriptions we have for the spec- 
ified item, around 59 transcriptions per item with the averaged 
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vocabulary size 230. The comprehension scores reported in this 
paper are biased by the fact that we used the same data to train 
the language models. We assume that this bias is not big since 
we deal with a very narrow domain. 


2.2.1. Advantages of item-specific language models 


With the constraints that the decoding should be finished in real- 
time and the decoding devices are mobile, the ASR accuracy de- 
creases significantly when the language model is big. We deal 
with a narrow domain with expected answers. Building a small 
and constraint language model can help to decode quickly and 
achieve good accuracy. It helps to overcome the challenge that 
children’s speech contains larger acoustic variability because of 
their variable vocal tract length and formant frequencies [17] 
and some other model mismatch challenges. It gives the bene- 
fit of the doubt for accent or non-native speech. Many spoken 
assessment applications fall to the category that has a narrow 
domain with expected answers. Item-specific language models 
with smaller vocabulary sizes are preferred, and are often used 
in practice for spoken assessment applications [14]. 


2.2.2. The rule-based language model 


For each expected passage, sentence, phrase or token sequence, 
a simple direct graph is built that has a path from the first word 
in the sequence to the last word [15, 16]. Different direct arcs 
with probabilities are added to represent different classes of 
changes made by subjects, such as skipping, repeating, insert- 
ing, and substituting. Adding a back-off arc will allow domain 
words to be spoken in any order. Both changes and probabil- 
ities can be learned from data. Using domain data can help 
to build better language models. The graphs generated from 
several different expected answers can be combined together 
with the expected probabilities as the final RBLM. Naturally, 
RBLMs give the expected answer sequences higher probabili- 
ties, the less likely orders lower probabilities. RBLMs can be 
compiled on-line. It gives us the flexibility to recognize any 
contents that are generated dynamically. Humans can add arbi- 
trary reasonable rules to be used by RBLMs directly. 


2.3. The ASR decoder 


The decoding engine [18] is based on KALDI [19]. The modi- 
fications were made to utilize mobile devices’ single instruction 
multiple data (SIMD) and digital signal processor (DSP) frame- 
works. The supporting utils were built to convert RBLMs to 
finite state transducers (FSTs) for decoding. When we start to 
record responses, the engine decodes progressively every 0.128 
seconds. The decoding real-time factor floats around 0.2 on an 
iPad Mini 4. We chose the acoustic scale so that insertions and 
deletions are balanced to avoid ignoring the speech signal. 


2.4. An ASR performance comparison 


Although Google cloud speech API [20] (GSpeech) can be used 
off-the-shelf without any additional modifications, the word er- 
ror rates (WERs) are rather high on our children reading and 
retelling tasks. The main reason is that GSpeech is designed for 
recognizing any general English with a broad language model. 
Its purpose is too general to perform well in this narrow domain. 
For the 282 PRead responses, GSpeech achieved WER 34.8% 
(n=23,736) and our on-device ASR engine achieved WER 
10.7%. For the 282 PRetell responses, GSpeech achieved WER 
32.9% (n=11,070) and our on-device ASR engine achieved 
WER 16.3%. Our server-side ASR engine that used broadband 
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speech, featured more elaborate acoustic models, and utilized a 
larger beam value can achieve WER 9.6% for the 282 PRetell 
responses. 


3. Machine scoring methods 
3.1. Word correct per minute 


The number of correct words is derived by using the ASR result 
to do edit-distance with the prompt. The insertions caused by 
disfluencies were ignored. The time duration is from the begin- 
ning of the first correct word to the end of the last correct word 
according to the ASR result. The session level WCPM is the 
median WCPM value from the three passages, a widely used 
procedure in measuring oral reading fluency [21]. 


3.2. Expression 


Although our engine generates a lot of features, only relevant 
and normalized ones were used (Table 1). These features don’t 
depend on the length of the materials produced. The difference 
between log_seg_prob and nlog_seg_prob, iw_log_seg_prob 
and niw_log_seg_prob is that for the latter: a) when we built 
native duration statistics, the durations were normalized by the 
articulation rate of each response; b) when we computed seg- 
mental probabilities, the durations of segments were normalized 
by the articulation rate of the corresponding response. These 
can help remove the effects of the speaking rate, which usually 
has a strong correlation with human expression scores. 


Table 1: The features used to predict expression scores. 


feature description 
log_leading-sil Log the leading silence duration. 
ait Articulation rate: the number of phonemes per second 


of speech. 


Rate of speech: the number of phonemes per second of 


ros : 
speech and inter word pauses. 


The averaged log likelihood segmental probability for 


log_seg_prob phonemes [22] based on Librispeech native statistics. 


The averaged log likelihood segmental probability for 
inter word silences [22] based on Librispeech native 
statistics. 


iw_log_seg_prob 


nlog_seg_prob The normalized version of log-seg_prob. 


niw_log_seg_prob The normalized version of iw_log_seg_prob. 


The acoustic model log likelihood of the recognized re- 


amloglike sult normalized by the total number of frames. 


The language model log likelihood of the recognized 


Imloglike result normalized by the total number of frames. 


3.3. Comprehension 


In natural language processing (NLP), it is becoming popular 
to use neural network based unsupervised learning algorithms 
to represent variable-length pieces of texts, such as words, sen- 
tences, passage, and documents as fixed-length real value fea- 
ture representations that encode the meaning of texts. For these 
methods (e.g. word2vec [23] or doc2vec [24]), the training ob- 
jective is usually to learn better word/document vector repre- 
sentations so that they can be used to predict the nearby words 
with higher probabilities. As a consequence, in the trained con- 
tinuous vector space semantically similar words or documents 
are mapped to similar positions. Meaningful results (e.g. king - 
man + women = queen) can be obtained by adding/subtracting 
these vectors. These methods achieved better performance in 
many NLP tasks [23, 24]. 

We seek semantic similarity measurements between the 
prompt passages and the retelling responses that are automated 
and objective. The vector representations of documents could 


be a good fit. Comparing to previous works [25, 26], our task 
could be easier to handle since the domain has been constrained 
by the prompts. 

The number of words spoken or the number of different 
words used could be a good indicator of the similarity if the 
subjects are in good-faith, although such nonlinguistic surface 
features are too superficial. We are more interested in the mea- 
surements that can check semantic similarity directly, and don’t 
have strong correlations with these surface features. The num- 
ber of common spoken tokens or the number of common spoken 
types could be a good semantic similarity indicator, but it could 
fail when a subject uses semantic similarity words or phases 
that are not in the prompt. Using word2vec or doc2vec may fix 
some issues. Assume that every response can be converted to a 
vector that represent the whole content of the response, the se- 
mantic similarity between two documents may be computed by 
checking the distance or similarity of two vectors. As a result, 
we proposed features w2v_ed, d2v_cos, LSL_cos. All potentially 
useful similarity metrics for the comprehension scores we are 
interested in are listed in Table 2. 

As words can be represented by real number vectors, we 
may use the centroid of the word vectors of the text to represent 
the text. Usually it makes sense to remove stop words before 
computing the centroid. We did observe the performance gain 
by doing so. We can use either cosine similarity or Euclidean 
distance between two vectors to serve as a measure of the sim- 
ilarity between two texts. For w2v, we observed a significant 
performance gain by using Euclidean distance. We used the 
simple average of the word vectors of the text as the centroid 
to represent the text. We didn’t observe any performance gain 
when using TF-IDF weighted average of word vectors. Further- 
more, we may use different statistical functions to aggregate the 
word vectors to represent the document. We concatenated 4 sta- 
tistical vectors (mean, minimum, maximum, media) together to 
form a 4 * 300 = 1200 dimension vector for a document. It 
can produce better results. We hypothesize that the distribution 
of word embedding vectors plays an important role to represent 
the document. The statistical vectors may catch some properties 
of the distribution. 

We used Googles word2vec pre-trained vectors that were 
trained on part of Google News dataset (about 100 billion 
words). The archive is available online as GoogleNews-vectors- 
negative300.bin.gz [27]. The model contains 300-dimensional 
vectors for 3 million words and phrases. 


Table 2: Some potential useful similarity metrics between 
prompts and responses for comprehension scores. 


feature description 


The number of words were spoken in the response normalized 


niokeas 2 by the number of words in the prompt. 


The number of different words were spoken in the response nor- 
malized by the number of different words in the prompt. It is a 
measure of the vocabulary size in the response. 


ntypes_n 


The number of the same words between the prompt and the 
response normalized by the number of different words in the 
prompt. It is a measure of the overlapped vocabulary size be- 
tween the prompt and the response. 


nctypes_n 


Euclidean distance between two documents’ statistical vector 
representations that are derived from word embeddings after re- 
moving stop words. 


Cosine similarity between two documents’ vector representa- 
tions based on doc2vec [24]. 


Word mover’s distance [28] between two documents based on 
word embeddings after removing stop words. 


Cosine similarity between two documents’ vector representa- 
tions derived from Latent Semantic Indexing based on the term 
vector model [29]. 


LSLcos 
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4. Experimental results 


A preliminary study of the Moby.Read system was conducted 
with a sample of 99 children in grades 2-4 from four different 
elementary schools [1, 2]. Recordings of PRead and PRetell 
of three grade-level unpracticed passages from the preliminary 
study were used. 5 children who barely produced meaningful 
responses (silence or inaudible) were excluded from this study. 
The total number of subjects in this study are 94. The details 
about the raters’ qualifications, training procedures, rubrics and 
how to derive human WCPM, expression and comprehension 
scores can be found in [2]. 

The results reported were produced on a simulation plat- 
form that mimicked the conditions as if it were run on mobile 
devices. The real-time turnaround was verified on mobile de- 
vices. The system was published on Apple App Store [3]. 


4.1. Word correct per minute scores 


A scatter plot of WCPMs between human and machine was 
shown in Figure 1. The correlation between two expert raters 
is 0.991. It repeated the conclusion we knew: machine can pro- 
duce reliable WCPMs for oral reading fluency [16, 30, 31]. 


— Fit blue, n = 94, r = 0.970 


ASR-WCPM 
b 
a 


100 125 


Human-WCPM 


150 


Figure 1: Session-level scatter of median WCPM. 


We listened to the 6 recordings of two outliers. Both spoke 
quite softly and one mumbled the readings for certain time pe- 
riods. The consequence of low Signal-to-Noise Ratio (SNR) 
makes it difficult to recognize certain part of signals. Google 
cloud speech API only recognized 25.2% words correctly and 
ignored most of the signal for these 6 recordings. The same 
kind of outliers and issues for kids were identified before [32]. 
Addressing low SNR effectively by instructions, e.g. avoiding 
high background noise and low speech volume, is the key solu- 
tion we are looking for. “Be in a QUIET place” is on the sign-in 
page of the app. Speaking clearly instead of mumbling so that 
others can hear is the requirement. Reliable scores depend on 
audible speech. 


4.2. Oral reading expression scores 


Every recording of PRead was rated by 3 different human raters 
for their ‘Oral Reading Expression’ scores on 6 categories, with 
5 representing the best rating and 0 representing silence or irrel- 
evant or completely unintelligible material. The rating distribu- 
tion is: 0, n=5; 1, n=37; 2, n=95; 3, n=263; 4, n=304; 5, n=142. 
The average of the correlations of human raters who correlate 
with the average of others at the response level is 0.795. For the 
3 pairs of raters, the average of the inter-rater correlations at the 
response level was 0.740. 

The correlations among features we discussed in Subsec- 
tion 3.2 and human ratings at the response level are shown in 
Table 3. The speech rate features have the highest correlations 


with human ratings. Putting more weights on the features nor- 
malized by speaking rate may downplay the role of rate. 


Table 3: Feature cross correlations for expression scores. 


hv 1 2. 3 4 5 6 ig 8 
I:log_leading_sil -0.18 
2:ros 0.82 | -0.22 
3:art 0.77 | -0.20 | 0.92 
4:log_seg_prob 0.81 | -0.19 | 0.90 | 0.92 
S:iw_log_seg_prob 0.50 | -0.17 | 0.62 | 0.38 | 0.45 
6:nlog_seg_prob 0.60 | -0.11 | 0.65 | 0.69 | 0.65 | 0.40 
7T:niw_log-_seg_prob | 0.31 | -0.12 | 0.39 | 0.14 | 0.23 0.87 | 0.27 
8:amloglike 0.59 0.10 | 0.61 0.49 | 0.58 | 0.40 | 0.60 | 0.28 | 
9:Imloglike -0.57 | 0.25 0.82 0.69 | -0.68 | -0.70 | -0.62 | -0.53 | -0.55 


The final session-level expression is an average of individ- 
ual expression scores. If a response doesn’t have enough infor- 
mation to generate an expression score, it will be ignored when 
computing the final session-level expression score. 

Using a neural network model 10-fold cross-validation, we 
achieved correlation 0.856 (0.902) in response (session) level. 
This is better than a linear regression model 0.840 (0.887). We 
made sure different folds have no overlap of the same subjects. 


4.3. Comprehension scores 


Every recording of PRetell was rated by at least 4 different hu- 
man raters for their ‘Retelling Comprehension’ scores on 7 cat- 
egories, with 6 representing the best rating and 0 representing 
silence or irrelevant or completely unintelligible material. On 
average, there are 4.5 ratings per response. The rating distri- 
bution is: 0, n=25; 1, n=128; 2, n=173; 3, n=288; 4, n=266; 
5, n=203; 6, n=179. The average of the correlations of human 
raters who correlate with the average of others at the response 
level is 0.842. For the 11 pairs of raters who have more than 
100 common ratings, the average of the inter-rater correlations 
at the response level was 0.786. 


Table 4: Feature cross correlations for comprehension scores. 


hv 1 2 3 4 5 6 
I:ntokens_n | 0.82 
2:ntypes_n 0.84 | 0.95 
3:nctypes.n | 0.87 | 0.85 0.90 
4:w2v_ed -0.83 | -0.78 | -0.84 | -0.84 
5:wmd -0.85 | -0.75 | -0.79 | -0.92 | 0.88 
6:d2v_cos 0.71 | 0.61 | 0.60 | 0.73 | -0.68 | -0.79 
7:LSI_cos 0.65 | 0.52 | 0.54 | 0.72 | -0.65 | -0.83 | 0.83 


We calculated the cross correlations (Table 4) at the re- 
sponse level among features we discussed in Subsection 3.3 
and human ratings. These features were extracted using hu- 
man transcriptions. The performances of d2v_cos and LSI_cos 
depend on the training settings: e.g. the training corpus, ran- 
dom seeds and setting parameters. In Table 4 we reported the 
best results we achieved for d2v_cos and LSI_cos from several 
trials. Because of the limited domain data, the potential over- 
fitting and weaker correlations comparing to others, we didn’t 
explore them further. All other results didn’t involve overfitting. 

It can be seen that the normalized number of different words 
spoken in the response is a good indicator of comprehension. 
When the subject is in the good-faith (it is almost always the 
case for K-5 grade kids), it makes sense since comprehension 
will depend on the complexity of the materials produced. By the 
nature of PRetell, a lot of term overlap is expected. The table re- 
flects that only considering the words in the prompt can improve 
the performance significantly. Among the features that utilize 
the word embedding similarities by considering and weighting 
the semantically similar words that are not in the prompt and 
are ignored by nctypes, w2v_ed and wmd are good ones. 
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There is no setting parameter for the features nctypes_n, 
w2v_ed, wmd. We noticed that wmd has a strong correlation 
with nctypes_n. At the same time, wmd used the same word 
embedding as w2v-ed. In that sense, w2v_ed could be better to 
enhance the final performance. We scaled nctypes_n, w2v-_ed to 
the range [0, 1] and then used their simple average as our final 
metric and achieved r=0.888 at the response level. 

All the comprehension performance discussed so far is 
based on human transcriptions. In the real application, we used 
the ASR recognition results to do the computation. It drags 
down the performance a little bit. Following the same proce- 
dures discussed but using the ASR transcriptions, we achieved 
t=0.903 at the session level (Figure 2). After collecting enough 
data, using a complex supervised machine learning model that 
can combine different features discussed in Table 2 together 
may further improve the final performance. 


— Fit blue, n = 94, r= 0.903 
0.9 4 


0.6 


Machine-Comprehension 


0.4 4 


Human-Comprehension 


Figure 2: Session-level scatter of comprehension. 


5. Conclusions 


We built an oral reading assessment system on mobile devices 
that delivers reliable WCPM, expression, and comprehension 
scores in real-time for first-language learners in grades 2-4 [3]. 
Our RBLMs relieve the requirement of field data collection for 
new reading passages to produce WCPM and expression scores; 
however, data collection is still required for passage retellings 
in order to build suitable language models to achieve the best 
performance. The proposed idea of producing comprehension 
scores by measuring the semantic similarity between the prompt 
passage and the retelling response utilizing the document em- 
beddings works well. For both expression and comprehension 
scores, the human-machine correlations are better than the hu- 
man inter-rater ones, which validates the effectiveness of the 
system. The findings support the use of machine scoring meth- 
ods to measure oral reading fluency skills automatically. 

We expect the system can be highly useful beyond the ap- 
plication discussed here, such as in second-language learning 
for adults as well as children. Assessing in real-time means the 
system can rapidly adapt to a learner’s performance, which can 
be used by learning systems to condition immediate, personal- 
ized feedback and select the next challenge within a session. 
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